arXiv:l 504.07000V 1 [cs.LG] 27 Apr 2015 


1 


Accelerated nonlinear discriminant analysis 

Nikolaos Gkalelis and *Vasileios Mezaris 

^Informatics and Telematics Institute, CERTH, 6th Km Charilaou-Thermi Road, RO. BOX 60361, Thermi 
57001, Greece, Tel. ++ 30 2311257774, Fax ++ 30 2310474128, email: {gkalelis, bmezaris}@iti.gr 


TABLE I 
Nomenclature 


i, k\ j, 1; n, m 

X e R-P’ 

x? 


N- Ni-, Ni, 


Pi^ Pi,. 
nL 6 i 
Ol e I 
1 l 6l 
JlxL 6 
Xi e 
^i,j 6 

X e 

St e R 
Su, 6 I 
G IR 

Bt e E 
e ] 

Bfj e 




)L 

.L 

.L 

I Ly(. It 
.FxNi 
^FxNij 

,FxN 

FxF 

.FxF 

FxF 

^FxF 

NxN 

iNxN 

NxN 

^NxN 




7^(B) 

Af(B) 


non-zero natural numbers; |1,..., a}, a e N* 
class counters; subclass counters; observation counters 
observation in i^-dimensional real vector measurement 
n-th training observation of t-th class 
n-th training observation of j'-th subclass in class i 
estimated total sample mean 

estimated sample mean of class i and sublcass {i,j) 

total number of training observations; observations 

in r-th class; observations in j-th subclass of class i 

estimated prior of i-th class and (i, j) subclass 

vector with all elements equal to 

vector with all elements equal to zero 

vector with all elements equal to one 

matrix with all elements equal to one 

matrix of observations in class i 

matrix of observations in subclass {i,j) 

block matrix of all training observations 

total scatter matrix 

within-class scatter matrix 

between-class scatter matrix 

inter-between-subclass scatter matrix 

matrix satisfying St = XBtX^ 

matrix satisfying S™ = XBu,X^ 

matrix satislying S;, = XBi,X^ 

matrix satisfying 85^5 = XBf,g(,X^ 

element of matrix B corresponding to x”, x^ 

element of matrix B corresponding to xfj, x^j 

range space of matrix B 

null space of matrix B 


it offers a much more compact representation by projecting 
the data to a very low dimensional space while preserving the 
clust ering structure of the specific task. 


For instance, in classification tasks the final classifier in 
silfa® projection subspace deals more effectively with the curse 
of dimensionalitjQ, thus increasing the generalization perfor¬ 
mance and speed up the classifier. The latter is very important 
in big data problems. In visualization tasks it offers more 
informative representations preserving valuable information by 
the extreme dimensionality reduction capabilities even in two 
or three dimensions. 

However, the computational cost of computing the eigen- 
pairs of NDA has prohibited this technology from the 
widespread use in today’s big data problems. Recently, a 
number of methods have been proposed to speed up NDA 
techniques. Here we provide a general framework accelerating 
the generalized eigenvalue decomposition of NDA approaches. 
Moreover, we show that the combination of the proposed 
method provides improved detection rates and computational 
efficiency when combined with LSVM, over the use of LSVM 
and KSVM alone. 

H. Fundamentals of nonlinear discriminant 

ANALYSIS 


A training set in the feature space may be described using 
a block data matrix 


Abstract —In this paper, a novel nonlinear discriminant analy¬ 
sis is proposed. Experimental results show that the new method 
provides state of the art performance when comhlned with LSVM 
in terms of training time and accuracy. 

Index Terms —Feature extraction, discriminant analysis, ker¬ 
nels, support vector machines, generalized elgenprohlems. 

I. Introduction 

Nonlinear discriminant analysis (NDA) using kernel func¬ 
tions is a fundamental dimensionality reduction (DR) tech¬ 
nique with applications in several domains such as machine 
learning, information retrieval, pattern detection, visualization 
and other. The NDA optimization problem is reformulated as 
generalized eigenproblem (GEP). The mathematical treatment 
of GEP is one of the most aesthetically pleasing problems in 
all numerical algebra. This fact contributed to the increasing 
interest of NDA-based techniques. The major advantage over 
linear DA (EDA) is that it captures data nonlinearities thus 
offering more informative feature representation in most real- 
world problems. Moreover, in comparison to other popular 
component analysis (CA) techniques (e.g. PCA, ICA, LEE) 


$ = [Xi,...,Xc]eK^^'^ (1) 

Eet the block matrix X = [Xi,..., Xc] 6 represents 

an annotated training set of C classes and N observations x„, 
i e Np, n e N^, in the input space R.^, where the ith block 
Xi, contains the Ni observations x„,n e 1) C of class 
i, and K is the respective set of indices. In many real-world 
applications, class distributions are not linearly separable and 
the direct application of a linear classifier in the input space 
may provide a poor performance. Kernel-based techniques 
address this problem by utilizing a vector-valued function </>(•) 
to map observations non-linearly from the input space into 
some feature space IF C where the data are expected to 
be linearly separable im 

(/)(•) : R^ ^ J- C R^, (2) 

X d> = d>(x). (3) 

^The curse of dimensionality problem states that phenomena described 
in high dimensional spaces require much more parameters to capture their 
properties 
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The problem is then reformulated in terms of dot products, 
which in turn are replaced with kernel function evaluations 

= kn,s, (4) 

where, fc(-, •) is a Mercer kernel lUl. That is, we additionally 
require that </>(•) satisfies dUi. This allows the use of tradi¬ 
tional linear solvers in problems where classes are nonlinearly 
separable in the input space and additionally the application 
of them in the feature space is intractable (e.g. F is large or 
even infinite). Also, exploiting the kernel trick the mapping is 
intrinsic and the problem can be solved in the feature space 
without even knowing the actual mapping. 

Assuming that the block matrix representing the training set 
in the feature space is 

$= (5) 

nonlinear DA methods seek the linear transformation 'i' that 
simultaneously optimize the following criteria in the feature 
space 

argmintr(tl'^S^), argmaxtr(\I/^Sf,\Ir), (g) 

where S;, is the between-class scatter matrix and S is either 
the within-class scatter matrix S^, or the total scatter matrix 

St. 

For the reformulation of (|6]l using dot products we need 
an appropriate factorization of Sj, and S (i.e. or St). 
Starting from its solution space in is restricted to 
span($) O, m. This allows us to express each column of 
G as a linear combination of the mapped training data 

^ = #W (7) 


where W e contains the expansion coefficients. 

Substituting (l7]i in (|6]l the optimization criterion becomes 


argmintr(W^SW), argmaxtr(W^Sf,W), 

(8) 

w 

w 

where 

Sf, = 

(9) 



(10) 


St = 

(11) 


and S is replaced by S^, or St. The above matrices can 
be entirely expressed in terms of dot products. By replacing 
dot products with kernel evaluations, Sf,, S^, and St (called 
hereafter kernel scatter matrices) can be viewed as the scatter 
matrices of the Gram matrix 

K = (12) 

where each column in K is considered as a data point in R.^ 

a, 0. 

The matrix S is positive semidefinite (PSD) a (as shown 
also in Sections ?? and ??), and, thus, the optimal solution 
of ® is commonly approximated using pseudoinverse criteria 
||5l . That is, we compute the transformation matrix e R^^^ 
that maximizes a 

argmaxtr((W^SW)+W'^S6W), (13) 

w 


Considering that all kernel scatter matrices are symmetric 
PSD (SPSD), the above optimization problem is equivalent to 
solving the symmetric-semidefinite generalized eigenproblem 
(GEP) 

SbW = SWA. (14) 

where, A = diag(Ai,..., A_d), W = [-Wi,..., w^i] e R^^^ 
and {Xi,Wi) are the D nonzero eigenpairs (EP) of the SPSD 
matrix pencil (Sj,, S) with eigenvalue and eigenvector sets 

A(Sb, S) = {Xi e R;^| det(Sb — A^S) = 0; Ai > Xj ifi > j} 
g(Sb, S) = {wi e R-^|Sf,Wi = AiSwi}. 

(15) 

Simultaneous diagonalization techniques are used to solve 
the problem in case of both SPD or SPSD matrix S as 
explained in the following. Eirstly, we note that it can be easily 
shown that S; is SPSD and 0, Q 

St = Sb + S^, 

null(St) = null(Sb) fl null(Su,). 

Then according to Theorem 8.7.1 in 0 there exists a matrix 
W that simultaneously diagonalizes the above scatter matrices. 


It has been shown 

that the matrix optimizing the 

above 

criterion is given by 

W = FT 

(17) 

where, T e R^^^ is any nonsingular matrix and E e 
provides the following diagonalization 

^FxD 

r'^SbE 

— diag(l7’xr; 

(18) 

r^s^r 

= diag(0j^xr: 

(19) 

r^StT 

= IdxD 

(20) 

Db 

= diag(6r+l, ■ • ■ ,&r+s) 

(21) 

Db 

= diag(u;r+i,..., ^r+s) 

(22) 

and, 1 > br+i >,. 

. . , ^ ^ 0, 0 <C W'r-\-l 

...,< 


Wr+s < 1, bi + Wi = l, r = n - Tiu, s = D + - n, r^i = 

rank(Su,), rt = rank(St). Thus, the optimal transformation 
is provided by 

^ = #rT (23) 

Different constraints on the transformation matrix characterize 
a set of different algorithms 



= = diag(Orxr,Isxs) 

(24) 


= W^StW = lDxD 

(25) 


= W^KW = Idxd 

(26) 


where (l24li is the conventional KDA, (l25l l is the uncorre¬ 
lated kernel discriminant analysis (KUDA) and (|26] | is the 
orthogonal kernel discriminant analysis (OKDA). To this end, 
many authors exploit the special structure of the kernel scatter 
matrices and retrieve suitable factorization of them. 

We should note that all the above methods (ULDA, OLDA, 
NLDA) are equivalent when the following property holds 

rank(St) = rank(Su,) + rank(St,), (27) 

which is usually the case for high-dimensional feature vectors. 
Indeed when the above is valid r = ri,, s = 0, and 

— diag(l7..j^ xr(,; r—s) (28) 

— diag(0^j^xr(,i Irt—r—sxr*—r—s ). (29) 
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This explains why the above methods provide equivalent per¬ 
formance in experimental evaluations with high dimensional 
data. 

During classification a test sample 

A. Regularization 

One popular way to solve this problem is to apply a ridge- 
type regularization operator 

S S -f elivxAf (30) 

where e e is a regularization constant; or remove the 
null space of S^,. Then, one way to identify the optimal 
transformation is to use the inverse of S and identify the 
nonzero EPs of the pencil (S“^Sb, I). However, exploiting the 
inverse breaks the symmetry and definitness of the pencil and 
thus standard unsymmetric methods are necessary, which are 
slower and more susceptible to round off errors in comparison 
to symmetric one i). Moreover, when S = S^, the null space 
of Su, may contain significant discriminant information. 

When S is SPD then a common way to solve this problem 
is by using the Cholesky factorization and symmetric QR 
factorization (Algorithm 8.7.2 in fS]). Specifically, perform the 
Cholesky factorization of S = LL^, use the symmetric QR 
decomposition of C = to compute the matrix G 

and set W = L”^G. It can be easily shown that W provides 
the desired simultaneous diagonalization 

W'^SfeW = A, (31) 

W^SW = I, (32) 

and W^W = I. 

The above techniques may have large round-off error and 
may be time consuming. 


B. Factorization HH^ 

In in, m, the scatter matrices in the feature space are 
factorized as 



Sb = HbH^, 

(33) 


G _ -or iqrT 

(34) 


St = HtHf, 

(35) 

where, Hb 6 

H^,Ht 


Then, the problem can be reformulated using the following 

factorization 

Sb 

= $^Sb^ = KbKf, 

(36) 

s^ 


(37) 

St 

= $^St$ = KtKf, 

(38) 

where. 


Kb = Hb$ 

(39) 


II 

(40) 


Kt = Ht$. 

(41) 


Then, the GEP problem is reformulated as 

KfeK^ W = KK^ WA (42) 


where K equals to K^, or Kj. The above problem is con¬ 
nected with the computation of the generalized singular value 
decomposition (GSVD) of the matrix pair (K^,K^). That is 
the right generalized singular vectors and the square root of 
the generalized singular values of the matrix pair (K,K^) 
are the generalized eigenvectors and generalized eigenvalues 
of the matrix pencil (K^Kh, K^K) = (S^, S) (Theorem 8.7.4 
in 111)- This are computed using the complete orthogonal de¬ 
composition of [KfeKiu]^ and SVD of the derived orthogonal 
matrix H- The solution of the GEP problem in (l42l i can be 
also computed by using the SVD of Kj El. 

Differently from the above methods, in Eol, GD the cross- 
product algorithm is exploited used to exploit the above 
factorization and provide efficient solutions. Specifically, the 
EVD of H^Hf, is first computed to derive the EPs V;,, Af,. 

Then, the EVD of U^S^Uf, is computed to derive the 

_/2 

eigenvector matrix P, where Ub = Vf,Ajj . The projection 
matrix Q = Uf,P we can obtain Q^S^jQ = A^,. Then the 

_ -y /2 

final projection matrix is defined as P = QA^j . Note that 
r^SujP = I and r^SbP = A“^. The disadvantage of this 
method is that A^, may contain near zero or zero values and 
it is difficult to select the correct eigenvectors. This effect may 
be alleviated with regularization on the within and total scatter 
matrices. This will add additional time to retrieve the correct 
parameter. However, the major disadvantage is that the the 
cross-product algorithm in the first step is very susceptible to 
round off errors. 

C. Factorization KBK 

Generalized discriminant analysis (GDA) is based on the 
assumption that the data are centered in the feature space. 
Under this setting the between- and total-scatter matrices are 
factorized as follows 

Sb = KBbK, (43) 

St = KK, (44) 

where Bb e is a symmetric block diagonal matrix in 

the form 

Bb = diag(Bbi,... ,Bbc) (45) 

Bbi = —BAT-xAfi (46) 

GDA computes the SVD of K and solves an equivalent EP. 

To speed up the above computation in SRKDA first the fol¬ 
lowing EVD is computed identifying the orthonormal matrix 
V g 

BbVb = VbAb, (47) 

Next, the transformation matrix G e is identified by 

solving the following linear system for G 

KG = Vb, (48) 

It can be easily shown that the computed transformation matrix 
has the required properties 

G^SbG = Ab, 

G^StG = Idxd, 


(49) 

(50) 
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The eigenvectors of B^, can be efficiently obtained by the 
inspection of this matrix. Particularly, observing that is 
the eigenvector of a set of C orthogonal eigenvectors B;, 
is obtained as V = [vi i,..., vi.o] where 


Oat, 

1]V,- 


if * 4- j, 

if i == j. 


(51) 


with repeated eigenvalue 1. 

However, note that the vector l^r is in the null space 
of the training vector set $ and thus in the null space of 
the Gram matrix ^Iat = KIat = 0. Therefore, the all 
ones vector is the trivial solution of the linear matrix system 
and thus useless. Since 1 is a repeated eigenvalue, we pick 
Iat as our first eigenvector and use the as any other C 
orthogonal vectors in the space spanned by {vc} and the 
Gram-Schmidt process is used to orthogonalize the set of 
eigenvectors {vc|c = 1,..., C}. The derived vectors are then 
denoted as 


[10] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognition 
using kernel direct discriminant analysis algorithms,” IEEE Trans. 
Neural Netw., vol. 14, no. 1, pp. 117-126, Jan. 2003. 

[11] J. Lu, K. Plataniotis, A. Venetsanopoulos, and J. Wang, “An efficient 
kernel discriminant analysis method,” Pattern Recognit., vol. 38, no. 10, 
pp. 1788-1790, Oct. 2005. 


{vc,c = 1,.. .,C'-l|vJlAr = 0,vfv* = l,vfvj = 0,i^j} 

(52) 

Note that the Gram-Schmidt process is applied to the initial 
set of eigenvectors {vd and not in any random vector in . 

Concerning the computation of the solution of the linear 
matrix system problem, this can be efficiently solved, e.g., 
using Cholesky factorization. Thus, the computational cost of 
identifying V is negligible, and is extremely fast in compari¬ 
son to the application of other nonlinear DA algorithms. 


III. Conclusions 

In this paper we have proposed an efficient generalized 
eigendecomposition framework and showed how it can be 
applied to accelerate fundamental NDA techniques. Experi¬ 
mental results showed that the combination of LSVM with 
the accelerated methods achieve state of the art performance 
in terms of both accuracy and training time performance. 
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